{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CHAPTER 5 Getting Started with pandas\n",
    "\n",
    "这一节终于要开始讲pandas了。闲话不说，直接开始正题。之后的笔记里，这样导入pandas："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另外可以导入Series和DataFrame，因为这两个经常被用到："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pandas import Series, DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.1 Introduction to pandas Data Structures\n",
    "\n",
    "数据结构其实就是Series和DataFrame。\n",
    "\n",
    "# 1 Series\n",
    "\n",
    "这里series我就不翻译成序列了，因为之前的所有笔记里，我都是把sequence翻译成序列的。\n",
    "\n",
    "series是一个像数组一样的一维序列，并伴有一个数组表示label，叫做index。创建一个series的方法也很简单："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    4\n",
       "1    7\n",
       "2   -5\n",
       "3    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj = pd.Series([4, 7, -5, 3])\n",
    "obj"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，左边表示index，右边表示对应的value。可以通过value和index属性查看："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 4,  7, -5,  3])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=4, step=1)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj.index # like range(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然我们也可以自己指定index的label："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    4\n",
       "b    7\n",
       "a   -5\n",
       "c    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['d', 'b', 'a', 'c'], dtype='object')"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以用index的label来选择："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-5"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2['a']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "obj2['d'] = 6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "c    3\n",
       "a   -5\n",
       "d    6\n",
       "dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2[['c', 'a', 'd']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里['c', 'a', 'd']其实被当做了索引，尽管这个索引是用string构成的。\n",
    "\n",
    "使用numpy函数或类似的操作，会保留index-value的关系："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    6\n",
       "b    7\n",
       "c    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2[obj2 > 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d    12\n",
       "b    14\n",
       "a   -10\n",
       "c     6\n",
       "dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2 * 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "d     403.428793\n",
       "b    1096.633158\n",
       "a       0.006738\n",
       "c      20.085537\n",
       "dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "np.exp(obj2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另一种看待series的方法，它是一个长度固定，有顺序的dict，从index映射到value。在很多场景下，可以当做dict来用："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'b' in obj2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'e' in obj2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还可以直接用现有的dict来创建series："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon':16000, 'Utah': 5000}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ohio      35000\n",
       "Oregon    16000\n",
       "Texas     71000\n",
       "Utah       5000\n",
       "dtype: int64"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj3 = pd.Series(sdata)\n",
    "obj3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "series中的index其实就是dict中排好序的keys。我们也可以传入一个自己想要的顺序："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "states = ['California', 'Ohio', 'Oregon', 'Texas']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California        NaN\n",
       "Ohio          35000.0\n",
       "Oregon        16000.0\n",
       "Texas         71000.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj4 = pd.Series(sdata, index=states)\n",
    "obj4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "顺序是按states里来的，但因为没有找到california,所以是NaN。NaN表示缺失数据，用之后我们提到的话就用missing或NA来指代。pandas中的isnull和notnull函数可以用来检测缺失数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California     True\n",
       "Ohio          False\n",
       "Oregon        False\n",
       "Texas         False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.isnull(obj4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California    False\n",
       "Ohio           True\n",
       "Oregon         True\n",
       "Texas          True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.notnull(obj4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "series也有对应的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California     True\n",
       "Ohio          False\n",
       "Oregon        False\n",
       "Texas         False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj4.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "关于缺失数据，在第七章还会讲得更详细一些。\n",
    "\n",
    "series中一个有用的特色自动按index label来排序（Data alignment features）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ohio      35000\n",
       "Oregon    16000\n",
       "Texas     71000\n",
       "Utah       5000\n",
       "dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California        NaN\n",
       "Ohio          35000.0\n",
       "Oregon        16000.0\n",
       "Texas         71000.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "California         NaN\n",
       "Ohio           70000.0\n",
       "Oregon         32000.0\n",
       "Texas         142000.0\n",
       "Utah               NaN\n",
       "dtype: float64"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj3 + obj4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个Data alignment features（数据对齐特色）和数据库中的join相似。\n",
    "\n",
    "serice自身和它的index都有一个叫name的属性，这个能和其他pandas的函数进行整合："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "obj4.name = 'population'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "obj4.index.name = 'state'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "state\n",
       "California        NaN\n",
       "Ohio          35000.0\n",
       "Oregon        16000.0\n",
       "Texas         71000.0\n",
       "Name: population, dtype: float64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "series的index能被直接更改："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    4\n",
       "1    7\n",
       "2   -5\n",
       "3    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Bob      4\n",
       "Steve    7\n",
       "Jeff    -5\n",
       "Ryan     3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']\n",
    "obj"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 DataFrame\n",
    "\n",
    "DataFrame表示一个长方形表格，并包含排好序的列，每一列都可以是不同的数值类型（数字，字符串，布尔值）。DataFrame有行索引和列索引（row index, column index）；可以看做是分享所有索引的由series组成的字典。数据是保存在一维以上的区块里的。\n",
    "\n",
    "（其实我是把dataframe当做excel里的那种表格来用的，这样感觉更直观一些）\n",
    "\n",
    "构建一个dataframe的方法，用一个dcit，dict里的值是list：\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.5</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.7</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.6</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.4</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.9</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>3.2</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2003</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pop   state  year\n",
       "0  1.5    Ohio  2000\n",
       "1  1.7    Ohio  2001\n",
       "2  3.6    Ohio  2002\n",
       "3  2.4  Nevada  2001\n",
       "4  2.9  Nevada  2002\n",
       "5  3.2  Nevada  2003"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], \n",
    "        'year': [2000, 2001, 2002, 2001, 2002, 2003], \n",
    "        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}\n",
    "\n",
    "frame = pd.DataFrame(data)\n",
    "\n",
    "frame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "dataframe也会像series一样，自动给数据赋index, 而列则会按顺序排好。\n",
    "\n",
    "对于一个较大的DataFrame，用head方法会返回前5行（注：这个函数在数据分析中经常使用，用来查看表格里有什么东西）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pop</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.5</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>2000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.7</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.6</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.4</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.9</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pop   state  year\n",
       "0  1.5    Ohio  2000\n",
       "1  1.7    Ohio  2001\n",
       "2  3.6    Ohio  2002\n",
       "3  2.4  Nevada  2001\n",
       "4  2.9  Nevada  2002"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果指定一列的话，会自动按列排序："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>state</th>\n",
       "      <th>pop</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2000</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2001</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2002</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>3.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2001</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2002</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2003</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>3.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   year   state  pop\n",
       "0  2000    Ohio  1.5\n",
       "1  2001    Ohio  1.7\n",
       "2  2002    Ohio  3.6\n",
       "3  2001  Nevada  2.4\n",
       "4  2002  Nevada  2.9\n",
       "5  2003  Nevada  3.2"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(data, columns=['year', 'state', 'pop'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果你导入一个不存在的列名，那么会显示为缺失数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], \n",
    "                      index=['one', 'two', 'three', 'four', 'five', 'six'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>state</th>\n",
       "      <th>pop</th>\n",
       "      <th>debt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>one</th>\n",
       "      <td>2000</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>2001</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.7</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>three</th>\n",
       "      <td>2002</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>3.6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>four</th>\n",
       "      <td>2001</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.4</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>five</th>\n",
       "      <td>2002</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.9</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>six</th>\n",
       "      <td>2003</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>3.2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       year   state  pop debt\n",
       "one    2000    Ohio  1.5  NaN\n",
       "two    2001    Ohio  1.7  NaN\n",
       "three  2002    Ohio  3.6  NaN\n",
       "four   2001  Nevada  2.4  NaN\n",
       "five   2002  Nevada  2.9  NaN\n",
       "six    2003  Nevada  3.2  NaN"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['year', 'state', 'pop', 'debt'], dtype='object')"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从DataFrame里提取一列的话会返回series格式，可以以属性或是dict一样的形式来提取："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "one        Ohio\n",
       "two        Ohio\n",
       "three      Ohio\n",
       "four     Nevada\n",
       "five     Nevada\n",
       "six      Nevada\n",
       "Name: state, dtype: object"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2['state']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "one      2000\n",
       "two      2001\n",
       "three    2002\n",
       "four     2001\n",
       "five     2002\n",
       "six      2003\n",
       "Name: year, dtype: int64"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2.year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意：frame2[column]能应对任何列名，但frame2.column的情况下，列名必须是有效的python变量名才行。\n",
    "\n",
    "返回的series有DataFrame种同样的index，而且name属性也是对应的。\n",
    "\n",
    "对于行，要用在loc属性里用 位置或名字："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "year     2002\n",
       "state    Ohio\n",
       "pop       3.6\n",
       "debt      NaN\n",
       "Name: three, dtype: object"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2.loc['three']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "列值也能通过赋值改变。比如给debt赋值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>state</th>\n",
       "      <th>pop</th>\n",
       "      <th>debt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>one</th>\n",
       "      <td>2000</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.5</td>\n",
       "      <td>16.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>2001</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.7</td>\n",
       "      <td>16.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>three</th>\n",
       "      <td>2002</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>3.6</td>\n",
       "      <td>16.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>four</th>\n",
       "      <td>2001</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.4</td>\n",
       "      <td>16.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>five</th>\n",
       "      <td>2002</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.9</td>\n",
       "      <td>16.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>six</th>\n",
       "      <td>2003</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>3.2</td>\n",
       "      <td>16.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       year   state  pop  debt\n",
       "one    2000    Ohio  1.5  16.5\n",
       "two    2001    Ohio  1.7  16.5\n",
       "three  2002    Ohio  3.6  16.5\n",
       "four   2001  Nevada  2.4  16.5\n",
       "five   2002  Nevada  2.9  16.5\n",
       "six    2003  Nevada  3.2  16.5"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2['debt'] = 16.5\n",
    "frame2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>state</th>\n",
       "      <th>pop</th>\n",
       "      <th>debt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>one</th>\n",
       "      <td>2000</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>2001</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.7</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>three</th>\n",
       "      <td>2002</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>3.6</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>four</th>\n",
       "      <td>2001</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.4</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>five</th>\n",
       "      <td>2002</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.9</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>six</th>\n",
       "      <td>2003</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>3.2</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       year   state  pop  debt\n",
       "one    2000    Ohio  1.5   0.0\n",
       "two    2001    Ohio  1.7   1.0\n",
       "three  2002    Ohio  3.6   2.0\n",
       "four   2001  Nevada  2.4   3.0\n",
       "five   2002  Nevada  2.9   4.0\n",
       "six    2003  Nevada  3.2   5.0"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2['debt'] = np.arange(6.)\n",
    "frame2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果把list或array赋给column的话，长度必须符合DataFrame的长度。如果把一二series赋给DataFrame，会按DataFrame的index来赋值，不够的地方用缺失数据来表示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>state</th>\n",
       "      <th>pop</th>\n",
       "      <th>debt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>one</th>\n",
       "      <td>2000</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>2001</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.7</td>\n",
       "      <td>-1.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>three</th>\n",
       "      <td>2002</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>3.6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>four</th>\n",
       "      <td>2001</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.4</td>\n",
       "      <td>-1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>five</th>\n",
       "      <td>2002</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.9</td>\n",
       "      <td>-1.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>six</th>\n",
       "      <td>2003</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>3.2</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       year   state  pop  debt\n",
       "one    2000    Ohio  1.5   NaN\n",
       "two    2001    Ohio  1.7  -1.2\n",
       "three  2002    Ohio  3.6   NaN\n",
       "four   2001  Nevada  2.4  -1.5\n",
       "five   2002  Nevada  2.9  -1.7\n",
       "six    2003  Nevada  3.2   NaN"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])\n",
    "frame2['debt'] = val\n",
    "frame2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果列不存在，赋值会创建一个新列。而del也能像删除字典关键字一样，删除列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>state</th>\n",
       "      <th>pop</th>\n",
       "      <th>debt</th>\n",
       "      <th>eastern</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>one</th>\n",
       "      <td>2000</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>2001</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>1.7</td>\n",
       "      <td>-1.2</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>three</th>\n",
       "      <td>2002</td>\n",
       "      <td>Ohio</td>\n",
       "      <td>3.6</td>\n",
       "      <td>NaN</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>four</th>\n",
       "      <td>2001</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.4</td>\n",
       "      <td>-1.5</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>five</th>\n",
       "      <td>2002</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>2.9</td>\n",
       "      <td>-1.7</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>six</th>\n",
       "      <td>2003</td>\n",
       "      <td>Nevada</td>\n",
       "      <td>3.2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       year   state  pop  debt eastern\n",
       "one    2000    Ohio  1.5   NaN    True\n",
       "two    2001    Ohio  1.7  -1.2    True\n",
       "three  2002    Ohio  3.6   NaN    True\n",
       "four   2001  Nevada  2.4  -1.5   False\n",
       "five   2002  Nevada  2.9  -1.7   False\n",
       "six    2003  Nevada  3.2   NaN   False"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2['eastern'] = frame2.state == 'Ohio'\n",
    "frame2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后用del删除这一列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "del frame2['eastern']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['year', 'state', 'pop', 'debt'], dtype='object')"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意：columns返回的是一个view，而不是新建了一个copy。因此，任何对series的改变，会反映在DataFrame上。除非我们用copy方法来新建一个。\n",
    "\n",
    "另一种常见的格式是dict中的dict："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pop = {'Nevada': {2001: 2.4, 2002: 2.9},\n",
    "       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "把上面这种嵌套dcit传给DataFrame，pandas会把外层dcit的key当做列，内层key当做行索引："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Nevada</th>\n",
       "      <th>Ohio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>2.4</td>\n",
       "      <td>1.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>2.9</td>\n",
       "      <td>3.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Nevada  Ohio\n",
       "2000     NaN   1.5\n",
       "2001     2.4   1.7\n",
       "2002     2.9   3.6"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame3 = pd.DataFrame(pop)\n",
    "frame3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另外DataFrame也可以向numpy数组一样做转置："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>2000</th>\n",
       "      <th>2001</th>\n",
       "      <th>2002</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Nevada</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2.4</td>\n",
       "      <td>2.9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ohio</th>\n",
       "      <td>1.5</td>\n",
       "      <td>1.7</td>\n",
       "      <td>3.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        2000  2001  2002\n",
       "Nevada   NaN   2.4   2.9\n",
       "Ohio     1.5   1.7   3.6"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame3.T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "指定index："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Nevada</th>\n",
       "      <th>Ohio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>2.4</td>\n",
       "      <td>1.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>2.9</td>\n",
       "      <td>3.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2003</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Nevada  Ohio\n",
       "2001     2.4   1.7\n",
       "2002     2.9   3.6\n",
       "2003     NaN   NaN"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(pop, index=[2001, 2002, 2003])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "series组成的dict："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pdata = {'Ohio': frame3['Ohio'][:-1],\n",
    "         'Nevada': frame3['Nevada'][:2]}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Nevada</th>\n",
       "      <th>Ohio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>2.4</td>\n",
       "      <td>1.7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Nevada  Ohio\n",
       "2000     NaN   1.5\n",
       "2001     2.4   1.7"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(pdata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其他一些可以传递给DataFrame的构造器：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/yv7rc.png)\n",
    "\n",
    "如果DataFrame的index和column有自己的name属性，也会被显示：\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "frame3.index.name = 'year'; frame3.columns.name = 'state'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>state</th>\n",
       "      <th>Nevada</th>\n",
       "      <th>Ohio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>2.4</td>\n",
       "      <td>1.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>2.9</td>\n",
       "      <td>3.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "state  Nevada  Ohio\n",
       "year               \n",
       "2000      NaN   1.5\n",
       "2001      2.4   1.7\n",
       "2002      2.9   3.6"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "values属性会返回二维数组："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ nan,  1.5],\n",
       "       [ 2.4,  1.7],\n",
       "       [ 2.9,  3.6]])"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame3.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果column有不同的类型，dtype会适应所有的列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[2000, 'Ohio', 1.5, nan],\n",
       "       [2001, 'Ohio', 1.7, -1.2],\n",
       "       [2002, 'Ohio', 3.6, nan],\n",
       "       [2001, 'Nevada', 2.4, -1.5],\n",
       "       [2002, 'Nevada', 2.9, -1.7],\n",
       "       [2003, 'Nevada', 3.2, nan]], dtype=object)"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame2.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3 Index Objects (索引对象)\n",
    "\n",
    "pandas的Index Objects (索引对象)负责保存axis labels和其他一些数据（比如axis name或names）。一个数组或其他一个序列标签，只要被用来做构建series或DataFrame，就会被自动转变为index："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "obj = pd.Series(range(3), index=['a', 'b', 'c'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['a', 'b', 'c'], dtype='object')"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index = obj.index\n",
    "index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['b', 'c'], dtype='object')"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index[1:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "index object是不可更改的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "Index does not support mutable operations",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-67-676fdeb26a68>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mindex\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'd'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/pandas/indexes/base.py\u001b[0m in \u001b[0;36m__setitem__\u001b[0;34m(self, key, value)\u001b[0m\n\u001b[1;32m   1243\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1244\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m__setitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1245\u001b[0;31m         \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Index does not support mutable operations\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1247\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: Index does not support mutable operations"
     ]
    }
   ],
   "source": [
    "index[1] = 'd'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "正因为不可修改，所以data structure中分享index object是很安全的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Int64Index([0, 1, 2], dtype='int64')"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels = pd.Index(np.arange(3))\n",
    "labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.5\n",
       "1   -2.5\n",
       "2    0.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2 = pd.Series([1.5, -2.5, 0], index=labels)\n",
    "obj2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj2.index is labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "index除了想数组，还能像大小一定的set："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>state</th>\n",
       "      <th>Nevada</th>\n",
       "      <th>Ohio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>year</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2001</th>\n",
       "      <td>2.4</td>\n",
       "      <td>1.7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2002</th>\n",
       "      <td>2.9</td>\n",
       "      <td>3.6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "state  Nevada  Ohio\n",
       "year               \n",
       "2000      NaN   1.5\n",
       "2001      2.4   1.7\n",
       "2002      2.9   3.6"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Nevada', 'Ohio'], dtype='object', name='state')"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame3.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'Ohio' in frame3.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "2003 in frame3.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与python里的set不同，pandas的index可以有重复的labels："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['foo', 'foo', 'bar', 'bar'], dtype='object')"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])\n",
    "dup_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这种重复的标签中选择的话，会选中所有相同的标签。\n",
    "\n",
    "Index还有一些方法和属性：\n",
    "\n",
    "![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/14j6g.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
